Steinbiss etal. Mobile DNA 201 2, 3:1 8 
http://www.mobilednajournal.eom/content/3/1/1 



O 



Mobile 
DNA 



SOFTWARE Open Access 



LTRsift: a graphical user interface for semi- 
automatic classification and postprocessing 
of de novo detected LTR retrotransposons 

Sascha Steinbiss, Sascha Kastens and Stefan Kurtz* 



Abstract 

Background: Long terminal repeat (LTR) retrotransposons are a class of eukaryotic mobile elements characterized by 
a distinctive sequence similarity-based structure. Hence they are well suited for computational identification. Current 
software allows for a comprehensive genome-wide denovo detection of such elements. The obvious next step is the 
classification of newly detected candidates resulting in (super-)families. Such a denovo classification approach based 
on sequence-based clustering of transposon features has been proposed before, resulting in a preliminary 
assignment of candidates to families as a basis for subsequent manual refinement. However, such a classification 
workflow is typically split across a heterogeneous set of glue scripts and generic software (for example, spreadsheets), 
making it tedious for a human expert to inspect, curate and export the putative families produced by the workflow. 

Results: We have developed LTRsift , an interactive graphical software tool for semi-automatic postprocessing ofde 
novo predicted LTR retrotransposon annotations. Its user-friendly interface offers customizable filtering and 
classification functionality, displaying the putative candidate groups, their members and their internal structure in a 
hierarchical fashion. To ease manual work, it also supports graphical user interface-driven reassignment, splitting and 
further annotation of candidates. Export of grouped candidate sets in standard formats is possible. In two case studies, 
we demonstrate how LTRsift can be employed in the context of a genome-wide LTR retrotransposon survey effort. 

Conclusions: LTRsift is a useful and convenient tool for semi-automated classification of newly detected LTR 
retrotransposons based on their internal features. Its efficient implementation allows for convenient and seamless 
filtering and classification in an integrated environment. Developed for life scientists, it is helpful in postprocessing 
and refining the output of software for predicting LTR retrotransposons up to the stage of preparing full-length 
reference sequence libraries. The LTRsift software is freely available at http://www.zbh.uni-hamburg.de/LTRsift under 
an open-source license. 

Keywords: Software, ERV, LTR retrotransposons, Classification, Postprocessing, Prediction, Annotation, User interface 



Background 

Large portions of eukaryotic genomes are repetitive, that 
is, the sequences in question appear in more than one 
genomic location. Repetitive DNA (in the scope of this 
paper also referred to as repeats) can be further subdi- 
vided into a hierarchy of categories, the most general of 
which are simple repeats - for example, satellite DNA or 
telomeres - and interspersed repeats, for example, trans- 
posable elements (TE). Interspersed repeats are abundant 
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in the genomes of many higher organisms. For example, 
about 46% of the human genome [1] and 38% of the mouse 
genome [2] consists of interspersed repeats, most of which 
are transposons. These, in turn, are further characterized 
by their transposition mechanism: class I transposons or 
retrotransposons replicate via an RNA intermediate, while 
class II transposons or DNA transposons replicate via a 
DNA intermediate [3,4]. 

An important order of retrotransposons are LTR retro- 
transposons, which are less common in animals, but 
the predominant order in plants. Similar in structure to 
retroviruses (Figure 1), they show long terminal repeat 
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Figure 1 Structure of a typical long terminal repeat retrotransposon. Adapted from [5]. AP:Aspartic protease; IN: integrase; LTR: long terminal 
repeat; PPT: polypurine tract; RH: RNase H; RT: reverse transcriptase; TSD: target site duplication. The numbers below the illustration denote typical 
lengths of the respective component. This example is of the copio-Wke superfamily, as it shows the IN-RT-RH domain order. 



(LTR) sequences at their flanks, with adjacent small direct 
repeats resulting from target site integration (target site 
duplications or TSD). The sequence between the LTRs 
typically encodes polyprotein genes, for example, gag and 
pol, containing structural proteins for an intracellular 
replication compartment as well as enzymatic functions 
required to perform the reverse transcription and inte- 
gration process. This includes an aspartic protease (AP), 
reverse transcriptase (RT), RNase H (RH), and integrase 
(IN) function. 

Additionally, several other structural elements are 
required for successful transposition. These include the 
primer binding site (PBS), which is essential as the start 
point for reverse transcription [6-8]. Typically the PBS is 
8 to 18 base pairs long and expected to be found directly 
downstream of the inner 5' LTR boundary. In this region 
of 8 to 18 nucleotides, it is also highly complementary 
to the 3 f region of a transfer RNA of the host organism. 
Another important feature is the polypurine tract (PPT), 
needed as a primer for plus-strand DNA synthesis. The 
A/G-rich polypurine tracts vary in length and are usually 
in the range of 8 to 22 bases [8]. Often a U-rich section 
(the so-called U-box) can be found just upstream of the 
PPT [9]. 

The presence of these distinctive structural features 
have led to the development of various software tools 
using these features as markers to identify potential LTR 
retrotransposon insertions (called candidates in the scope 
of this paper) [10-14]. These tools do not use any external 
reference sequence, an approach called de novo identifi- 
cation. The rationale behind this approach is that transpo- 
son sequences are typically species-specific and a purely 
homology-based identification approach is not guaran- 
teed to be successful. Instead, for transposon annotation 
in newly sequenced genomes with no or only few related 
and annotated genomes, a de novo approach is required. 
Some tools also detect internal features of the candidates 
and exploit their occurrence to improve the candidate 
detection results [13] or output the feature annotations for 
further analysis [5]. 

For about a decade, de novo software tools have been 
in use, delivering useful results in a variety of detailed 
studies covering LTR retrotransposons in insect [5,15,16], 
crustacean [17], mammalian [18,19], avian [20], meta- 
zoan [21] and plant [22,23] genomes. In these studies, the 



prediction is followed by several postprocessing steps to 
clean up the result set and to infer additional informa- 
tion from the predicted candidate sequences and features. 
For example, it is reasonable to separate all candidates 
from the analysis set which do not satisfy a given set of 
rules (for example, presence of significant open reading 
frames or profile hidden Markov models (pHMM) domain 
hits) to identify and discard potential false positives 
[5,12,21]. 

Furthermore, it is desirable to classify each candidate 
according to a hierarchical schema consisting of classes, 
subclasses, orders, superfamilies and - on the lowest level 
- individual families. We will use these terms as denned 
in the classification scheme proposed by Wicker and col- 
leagues [4], This schema has been widely accepted despite 
some initial discussions regarding its consistency [24] and 
originality [25] . 

It has to be noted that in this paper we will focus on 
the order of LTR retrotransposons from the retrotranspo- 
son class (class I), as such elements are the result of the de 
novo identification tools. However, if more general repeat 
detection approaches are used, a preclassification on a 
higher level is possible using existing software, for exam- 
ple, TEclass [26] on the class level, or REPCLASS [27] up 
to the superfamily level. 

In the LTR retrotransposon order, several superfami- 
lies have been established, for most of which membership 
can be determined by the order of protein domains in the 
coding region. For example, copia-kke elements (IN-RT- 
RH configuration) are distinct from gypsy-like elements 
(RT-RH-IN configuration). Thus if protein domain loca- 
tions are known, subfamily assignment is straightforward. 
If protein domain locations are not known, pHMM- 
based approaches using superfamily-specific reference 
sequences have proven to be successful [21]. 

A more fine-grained classification groups the candi- 
dates into putative families. Most studies perform the 
family classification using either sequence-based cluster- 
ing of the inter-LTR region of the predicted candidates 
[15-17] according to fixed similarity thresholds (for exam- 
ple, the 80-80-80 rule [4]), by inferring family membership 
from phylogenies [19,20,28] or by using graph clustering 
algorithms based on stochastic flow [21], a method that 
does not require any arbitrary similarity threshold. Other 
methods - though not used in an LTR retrotransposon 
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context - follow an aggregation procedure based on pair- 
wise distances [29]. Another approach is to obtain sets 
of similarity-based clusters per feature (LTR, PBS, PPT, 
protein domains and so on), then to combine them adher- 
ing to the principle of cluster compatibility [5], arriving 
at final family assignments. This approach makes use of 
biologically relevant sequence information only and is 
expected to work for families containing candidates with 
partially deleted or mutated internal regions. A case study 
of this method on Drosophila melanogaster candidates 
predicted by LTRharv 'est [14] and annotated using LTRdi- 
gest [5] showed that this approach recovers the majority of 
the known families in this species, with well-reproduced 
reference sequences [5]. 

However, putting this family classification scheme into 
practice is rather complex and time-consuming. A possi- 
ble approach would be to augment the tabular file output 
of LTRdigest with cluster number data using third-party 
matching and/or clustering tools (for example, Vmatch 
[30]), then open the augmented table in a spreadsheet soft- 
ware (for example, Microsoft Excel or OpenOmce Calc) 
and to iteratively sort the candidates by cluster num- 
ber, inspecting candidates with identical cluster number 
sets and marking compatible groups as putative fami- 
lies in the process. Removal of uninteresting candidates 
or false positives is in this case only a matter of delet- 
ing the corresponding rows in the spreadsheet. While 
this manual approach is feasible (but tedious) for smaller 
non-mammalian genomes (713 candidates and 25 features 
in the D. melanogaster example from [5]), it does not 
scale too well with growing candidate sets. This is due 
to both the long run times of sorting rows in such soft- 
ware and the presentation and/or visualization of results 
getting more and more difficult to follow with increasing 
candidate and family numbers. Another important issue 
is the reproducibility of such a manual approach when 
prediction runs are repeated, for example with modified 
parameters. 

Since the classification as proposed is an algorithmically 
well-defined problem, an alternative would be a purely 
automatic software implementation of the approach, for 
instance using a scripting language. However, a human 
expert often would like to inspect and improve the result- 
ing family assignments, not only to get an impression of 
how the individual families look, but also to verify that no 
two families were inadvertently joined, or that one family 
could be split up into two at a second glance. Later steps 
then would include matching the sequence of a family 
representative to a set of reference sequences to identify 
relatives to known families, or to create multiple sequence 
alignments (MSA) of family members determining varia- 
tions across family members. As an alternative to MSAs, 
the identification and analysis of units called modules has 
been proposed [31]. 



The final step, after filtering and classification, is 
the preparation of a species-specific reference sequence 
library of full-length LTR retrotransposons. Consisting 
of one representative sequence per putative family, this 
library can then be used as a basis for homology- 
based identification and classification of incomplete, non- 
autonomous insertions or solo LTRs in the whole genome 
sequence. This can be done using well-established soft- 
ware, for example, based on RepeatMasker [32]. 

In this manuscript we propose a de novo analysis 
approach of LTR retrotransposons relying on a com- 
bination of automatic and manually guided interactive 
processing of a given dataset. For this approach a com- 
prehensive workbench for LTR retrotransposon candidate 
postprocessing with the following features is needed: 

• intuitive display of a candidate set (for example, 
unclassified candidates or putative families) and its 
properties in a flexible, concise table representation 

• hierarchical display of detected features for each 
candidate, including a linear diagram illustrating the 
feature locations and orientations 

• convenient maintenance of putative families and 
their members via a drag-and-drop interface 

• flexible, extensible filtering and reassignment of 
candidates based on simple, annotation-based rules 

• assisted assignment of candidates to putative families, 
based on feature sequence matches, single-linkage 
clustering and joining of compatible candidates 

• automatic selection of candidates suitable for a 
reference library 

• automatic annotation of candidates with matches to a 
reference library 

• input and output of library sequences and 
annotations in standard output formats (GFF3 [33] 
and FASTA) to ensure interoperability with external 
preprocessing and/or postprocessing software. 

To address this need we have developed LTRsift, an open 
source graphical software tool implementing these fea- 
tures. As LTRsift is based on a larger software suite that 
also includes LTRharvest [14] and LTRdigest [5], a typical 
and complete use case will likely include all three tools. 

This paper is structured as follows: after familiariz- 
ing the reader with the interface and usage of LTRsift, 
we present two case studies showing how the software 
can assist a researcher in de novo analyses of complete 
genomes. First, we perform the Drosophila melanogaster 
analysis from the LTRdigest paper [5], exemplifying the 
use of the software. In a second use case, we briefly 
describe an analysis from scratch using LTRsift on a mam- 
malian genome, specifically the gray short-tailed opossum 
(Monodelphis domestica) genome. We show that LTRsift 
scales for such larger data sets yielding some interesting 
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results. We give a comprehensive discussion of the result 
and finally conclude with some remarks on the usability of 
the software. 

Implementation 

This section describes the use of the LTRsift software and 
the design of its user interface. We explain the import 
of input data into the software and the initial prepro- 
cessing. Then we describe the components of the LTRsift 
window and show what information they depict and how 
they allow curation of the candidate sets. We show how to 
define filtering rules, possibly incorporating all the infor- 
mation given in the annotation file. Finally, we give some 
examples for such rules. 

Working with data 

All data handled by LTRsift are organized as projects. 
A project is a collection of data, consisting of the can- 
didate annotation, the sequence set the annotations are 
based upon (given as an external encoded sequence file 
in the GtEncseq format [34] that is produced by soft- 
ware coming with LTRsift), and various metadata (such 
as the currently open tabs, parameter sets for classifica- 
tion/matching, and so on). A user can create a new project 
by selecting the 'New' entry from the 'File' menu. A series 
of dialog windows then guides the user through the steps 
of specifying all components of the project. The first step 
is to specify a project (file) name as well as the initial 
annotation and sequence inputs. This is done by choos- 
ing the corresponding GFF3 and GtEncseq files from disk. 
The GFF3 annotations have to satisfy the following basic 
requirements: 

• candidates must be of the Sequence Ontology [35] 
type 'LTR_retrotransposon, 

• additional features (for example, protein-match', 
primer _binding_site', 'RR.tract') must be children of 
this root type, and 

• the GFF3 sequence identifier for all features must 
start with seqX', where X is the sequence number 
(O-based) of the corresponding sequence in the 
associated genome, for example, seq3' references the 
fourth sequence in the encoded input index. 

The GFF3 output produced by LTRharvest and LTRdi- 
gest satisfies these requirements. More information about 
the input data formats can be found in Additional file 1, 
Section 1.2. 

If the user chooses to compute the matches required 
for the clustering process, then the second step is param- 
eterization of the matching and clustering parameters. 
LTRsift utilizes the sensitive and efficient sequence com- 
parison software LAST [36] to calculate matches if feature 
sequences are longer than 80 nucleotides on average. 



If the user has chosen to perform the automatic clas- 
sification at this point, the next step is to specify which 
features should be used as the basis of classification, that 
is, for which features in the annotation cluster numbers 
should be compared. In addition, a prefix for automati- 
cally assigned family names can be given. Since the notion 
of most complete candidates depends on the deviation of 
candidates from the group median in terms of LTR and 
whole-candidate length [5], deviation thresholds can be 
stated as well. All settings can be reviewed and corrected 
if necessary before starting the actual import and prepro- 
cessing run. This process finally delivers a new project 
file, which can later be opened at any time using the 
'Open entry from the 'File' menu. This allows the user to 
continue a previously interrupted session. 

User interface 

After opening a project file, its content appears in the 
main window of LTRsift. The main window is subdi- 
vided into four main components (Figure 2). The family 
list (Figure 2a) shows the current list of putative fami- 
lies, identified by their names. Names are automatically 
assigned whenever a new family is created by the classifi- 
cation algorithm. However, a family name can be edited at 
any time. 

A double click on a family opens a new tab holding the 
candidate list comprising the family (Figure 2b). Each tab, 
labeled with the family name, shows all member candi- 
dates of a given family, with columns specifying sequence, 
strand, location, LTR and element lengths, and cluster 
numbers for all detected features according to a color 
code. Colors are user-configurable via a style definition 
file. By default LTR, g^g-associated domains, AP domains, 
RT domains, RH domains and IN domains, as well as 
PPT and PBS are displayed in different colors. Candi- 
dates which could not be placed unambiguously in one 
of the families remain in an unclassified' tab, as do can- 
didates with no cluster numbers (singletons). The rows 
making up the candidate list can be sorted according to 
the values in any column. Moreover, individual columns 
can be hidden to improve legibility on screens with low 
horizontal resolution. Candidates can be moved from one 
family to another by dragging and dropping the respec- 
tive row in the candidate list. Whenever candidates are 
deleted from families, they are added to a project- wide list 
of unclassified candidates. Candidates deleted from the 
list of unclassified candidates will be removed from the 
project entirely. 

Clicking on an entry in the candidate list displays addi- 
tional detailed information. In particular, a hierarchical 
tree representation of the candidate features (Figure 2c) 
and a linear visualization of the candidate and its com- 
ponents (Figure 2d) are displayed. The latter depicts the 
candidate together with its genomic location, most likely 
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Figure 2 Screenshot of the LTRsift main window, (a) Putative family list, (b) candidate list, (c) candidate details, (d) candidate visualization. The 
currently loaded project contains 1 3, 943 candidates from the Monodelphis domestico genome, with the currently selected candidate showing a full 
set of detected features (PBS, PPT, protein domains). ORF detection and reference matching have been performed. Additional details, such as PPT 
and PBS sequences, Pfam IDs and so on, are available by scrolling to the right in (c). The graphical representation in (d) depicts the retrotransposon 
(red) with PPT and PBS as small lines in the two tracks below. The next track shows protein domain matches, coded in different colors. Here 
integrase domains are depicted in blue, reverse transcriptase domains in red, protease domains in purple, RNase H domains in gold, and any other 
domains in green. The RNase H domain is marked in red because it has been selected in the candidate detail list. The reference match in the track 
below (shown in yellow) spans the interior region of the candidate completely, suggesting that it likely is a full-length element. The bottom track 
shows open reading frames in blue color. LTR: long terminal repeat; ORF: open reading frame; PBS: primer binding site; PPT: polypurine tract; TSD: 
target site duplication. 



reading direction and internal features, intuitively spread 
out into separate tracks reminiscent of the representation 
used in a genome browser such as Ensembl [37]. 

Augmenting annotations with additional data 

LTRsift does not only allow displaying results of the auto- 
matic classification, but can also perform additional oper- 
ations which add extra information to the annotations 
stored in the project. One possible augmentation con- 
sists of detecting the longest open reading frames (ORFs) 
inside candidates. The ORFs are added to the candi- 
date annotations as 'reading-frame' features. For cross- 
referencing the de novo results with custom sequence 
data, LTRsift also allows users to match candidates against 



external reference sequences by calling BLAST [38] as an 
external matching tool. The parameter sets used for all 
matching runs are numbered and stored in the project 
both for documentation purposes and for simplifying 
multiple runs with varying but similar parameters. LTRsift 
allows flexible filtering of candidate sets, based on filter- 
ing rules defined in a simple programming language (see 
next section for more details). Candidates which do or 
do not satisfy the filtering conditions (or boolean combi- 
nation thereof) can be either unclassified or moved to a 
new family entry created on the fly. These postprocess- 
ing steps can either be performed on all candidates, on 
all candidates in a family, or on an arbitrary selection 
of candidates. 
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Filtering rule definitions 

Besides the described presentation of candidate sets, LTR- 
sift allows for flexible filtering of candidates based on 
their internal structure or external data. This is done by 
evaluating each candidate with regard to a given chain 
of annotation-based rules. Each rule specifies one aspect 
of the candidate which is satisfied or not, expressed 
via a boolean value {true or false). We call these rules 
predicates. For example, the presence of detected pro- 
tein domains is a predicate, as is the property of being 
Copia-kke (or Gypsy-like, for that matter). The predicate 
is captured in a rule returning the appropriate boolean 
value. 

When a filtering run is started, candidates are selected 
according to rules, keeping a candidate in its original fam- 
ily if the evaluation returns the value false. If it returns 
the value true, the desired behavior can be selected by 
the user. Either the candidate in question is taken out 
of a putative family and put back into the unclassified 
set. This means that filtered candidates already in the 
unclassified set are deleted from the project altogether. 
Alternatively the candidate in question can be assigned 
to a separate family, newly created for candidates which 
were filtered out in this run. Which one of these options 
is used can be selected in the filtering dialog (Figure 3). 
Rules are chainable: if more than one rule is given, LTR- 
sift allows their combination by requiring that all of them 
must be true to filter out a candidate (boolean AND), 
or that it suffices to have one of them evaluate to true 
(boolean OR). 

The rule is stored in a text file which is interpreted by 
the software when added to the project. Each filtering rule 
contains a set of metadata, such as its author contact, 
description, and version to ease distributability and repro- 
ducability of results (see Additional file 1, Section 1 for 
more details). 

As the filtering rules are not built into LTRsift itself, but 
rather described in Lua, a powerful but simple program- 
ming language [39], the filtering functionality is extensible 
which gives a user a maximum amount of flexibility. We 
will now show example filtering rules to illustrate how 
annotation data are accessed. 

Filtering by protein domain presence 

As the protein domain coding sequences are the main 
basis for the matching and clustering steps leading to fam- 
ily assignment, a common task is to remove all candidates 
which do not contain at least one domain hit. 

This task can be solved by a rule implemented in a 
function named filter. This function has access to a 
representation of each candidate in the form of a directed 
acyclic graph in which nodes represent individual features 
(for example, LTR, TSD, PBS, ...) and edges represent 
'part of relationships between such features. The latter 



indicate, for instance, that an LTR is part of an LTR retro- 
transposon, which in turn is part of a repeated region. 

The actual rule defined in the programming language 
Lua looks as follows: 

function filter (candidate_node) 

gfi = gt . f eature_node_iterator_new 

( candidate_node ) 
node = gfi : next () 
while not (node == nil) do 

if (node : get_type ( ) = = 
"protein_match" ) then 

return false 

end 

node = gfi mext ( ) 

end 

return true 

end 

The rule evaluates whether the candidate contains a 
protein.match' feature. The traversal stops and returns 
false once a node with the type protein_match' is found. 
If no such node is found, true is returned. This is the case 
when node becomes nil, indicating that all children have 
been examined without breaking the loop. 

Besides the type, the following additional data are stored 
in each node and can conveniently be queried from a 
filtering rule: 

• the sequence region (for example, chromosome, 
contig, and so on) the feature is located on, 

• location of the feature on that region in terms of start 
and end position (1-based), 

• strand (forward/reverse), 

• a numeric score value (the meaning of which depends 
on the tool which produced this value, for example, 
an E-value), 

• a set of key-value pairs containing arbitrary named 
attributes (for example, feature name/ID, anticodon 
for PBS-binding tRNAs, Pfam ID for matching 
pHMMs, and so on). 

Filtering by match coverage 

Another use case for the filtering component is to separate 
candidates with high local sequence similarity to a refer- 
ence sequence set from those that do not contain such 
similarities. To give an example, a filtering rule which fil- 
ters out all candidates not matching a reference sequence 
over at least 80% of their length is easily implemented 
(Figure 4). 

In the first step (lines 2 to 11), the length of the can- 
didate is calculated from the start and end positions of 
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• LTRsift - Filter 



[ £y>Add j Available filter rule files ^ Selected filter rule files Invert? 

Full Protein Domain Set 1.0 Full Protein Domain Set 1.0 

™ Remove Filters out candidates without full protein domain set \^\ Filters out candidates without full protein domain set 

oNew Protein Domain Filter 1.0 

{ — > Filters out candidates without protein domains 

( .4? Edit ) Reference Match Filter 1.0 

Selects candidates with long reference matches 



, • if all filters match (logical AND) 

Action to perform with filtered candidates: Unclassify/Delete Z 

\ | C if any filter matches (logical OR) 

Filter rule metadata 

Author 

Sascha Steinbiss 
Author email: 

steinbiss(a>zbh.uni-hamburg.de 
Description: 

Filters out a candidate if it does not contain at least one match for each RT, PR, INT and RH. 
[ Run filtering on 382 candidates ] | Cancel j 

Figure 3 Screenshot of the LTRsift filter selection dialog. The left side of the dialog shows the filtering rules added to the project and available 
to be used. The right side shows the filtering rules to be applied in the current filtering run. The checkbox next to each rule allows the user to negate 
it. This dialog is set to unclassify or delete all candidates not passing the filtering step - in this case, this means all candidates that do not contain 
protein domains and no long reference matches. The buttons on the left allow adding rules to the project and removing them again. Moreover 
rules can be edited directly from within LTRsift in a simple built-in text editor, avoiding the need to locate and open them in a separate text editor. 
Clicking the button on the lower left starts the filtering process. 



the node in this connected component with the type 
'LTR.retrotransposon! We only store the length of the last 
occurrence of such a node because in a valid annotation 
there is only one such node per connected component. 
The length is stored in the variable candidate_length. 
If no "LTR_retrotransposon" node was found, we are not 



looking at an LTR retrotransposon element (lines 12 to 
14). This case can occur when filtering annotation files 
in which LTR retrotransposon annotations occur besides 
other annotations, e.g. gene annotations, which this rule 
is designed to ignore. The second step then iterates over 
all features again, comparing the lengths of the matches 







1 


function filter (candidate_node) 




candidate_length = 0 


3 


gfi = gt . f eature_node_iterator_new (candidate_node) 




node = gfi: next () 


5 


while not (node == nil) do 


6 


if (node : get_type ( ) == "LTR_retrotransposon" ) then 


7 


range = node : get_range ( ) 


8 


candidate_length = range : get_end ( ) - range : get_start ( ) + 1 


9 


end 


10 


node = gfi: next () 


11 


end 


12 


if (candidate_length == 0) then 


13 


return true 


14 


end 


15 
16 


gfi = gt . f eature_node_iterator_new (candidate_node) 


17 


node = gfi: next () 


18 


while not (node == nil) do 


19 


if (node : get_type ( ) == "nucleot ide_match" ) then 


20 


mrange = node : get_range ( ) 


21 


if mrange : get_end ( ) - mrange : get_start ( ) + 1 > candidate_length * 0.8 then 


22 


return false 


23 


end 


24 


end 


25 


node = gfi: next () 


26 


end 


27 


return true 


28 


end 


Figure 4 Source code (in the programming language Lua) of the filtering rule for selecting/filtering candidates according to reference 


match coverage. The function computes the lengths of the candidate and reference matches contained in the candidate. If the length of at least 


one reference match exceeds 80% of the candidate length, the function returns false, otherwise true. 
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with 80% of the candidate length (line 21) and returns the 
appropriate value in the process (lines 16 to 27). 

Additional command-line tools 

For very large data sets or for scripted processing (for 
example, in an automated pipeline), the filtering and clus- 
tering functionality is also accessible using command-line 
tools, which are part of the GenomeTools package. 

Data export 

Candidate sequences (in FASTA format) and candidate 
annotations (in GFF3 format) can be exported to files. 
LTRsift supports exporting all candidates in the whole 
project into a single file as well as exporting the mem- 
bers of one or more specific families into one or multiple 
separate files. This makes it easy to prepare sequence sets 
for subsequent external analysis, for example, multiple 
sequence alignment. 

Software requirements 

LTRsift is intended for use on UNIX-like operating sys- 
tems, like Linux, Mac OS X or the BSD family of operating 
systems. The GUI is built upon the GTK+ version 2 toolkit 
for creating graphical user interfaces [40,41], a wide- 
spread library component which is already being used as 
the basis of many popular free software packages like the 
GIMP image editor [42] or the GNOME desktop environ- 
ment [43]. GTK+ is included in the majority of currently 
available Linux distributions. 

The necessary components for parsing and handling 
both sequence and annotation data are provided by 
the GenomeTools genome analysis library [44], freely 
available from http://genometools.org. For drawing the 
schematic illustration of each candidate, the AnnotationS- 
ketch engine [45] is used, which is based on the Cairo 2D 
vector graphics engine [46] for rendering. Again, Cairo is 
a standard library very likely to be present on a recent 
graphical UNIX or Linux machine. It can be installed on 
Mac OS X using one of several package managers, for 
example, Fink [47] . 

As an external component to compute initial matches 
for clustering, the LAST software [36] is used and must 
be installed on the system. Likewise, for reference match- 
ing the BLAST software (version 2.2 and higher) must be 
present in order to use this LTRsift feature. 

Also, we provide a statically linked version of LTR- 
sift for the Linux platform which does not require any 
pre-installed version of the GenomeTools shared library. 

Results 

To illustrate the use of LTRsift in de novo LTR retrotrans- 
poson analysis efforts, we performed example analyses of 
the Drosophila melanogaster (fruit fly) and Monodelphis 
domestica (gray short- tailed opossum) genomes. 



Results of the Drosophila melanogaster analysis 

We utilized LTRsift to semi-manually curate and 
postprocess putative LTR retrotransposon families 
and potential full-length members in the Drosophila 
melanogaster release 5.8 genome [GenBank:AE014134, 
GenBank:CM000456, GenBank:AE013599, GenBanlcCM 
000457, GenBank:AE014296, GenBank:CM000458, Gen 
Bank:AE014297, GenBank:CM000459, GenBanlcAEOl 
4135, GenBank:AE014298, GenBank:CM000460] from 
scratch, detected using LTRharvest and LTRdigest These 
were parameterized with range, similarity and offset 
constraints for the detection of LTR pairs, and with 
tRNA sequences as well as profile HMMs to annotate 
the internal region. These parameters were identical to 
the parameters used in previous publications [5,14]. We 
refer to Additional file 1 for more details on the param- 
eterization of these two external tools. For both the 
initial detection and the subsequent LTRsift analysis steps 
described below, an Intel Core 2 Duo system (2.4GHz, 
4GB RAM, Ubuntu Linux 12.04) was used. 

The encoded genome sequence and the sorted GFF3 file 
were added to the project using the guided project cre- 
ation dialog. Then matching and clustering of all feature 
sequences was performed in a matter of minutes, keep- 
ing the default parameters in the appropriate dialogs, with 
the exception that the query step size (LAST parame- 
ter -k) was increased to 10 to speed up the process. We 
required matches to span 80% of the shorter sequence and 
30% of the longer sequence. For the classification step, all 
detected features were used as evidence in the group join- 
ing process by selecting all of them in the classification 
dialog. As deviation thresholds for putative full-length 
candidate detection, an LTR length deviation of 50 bp 
and a candidate length deviation of 200 bp were set. No 
filtering was done at this time. 

After creating the project, the initial classification 
resulted in 359 candidates being split up into 48 puta- 
tive families, appearing in the putative family list. The 
remaining 354 (49.6%) remained unclassified. Afterwards 
we discarded all families with less than three members 
by using an automatic LTRsift feature which selects such 
families and offers to delete them. Of the original 48 fam- 
ilies, 37 remained. In the next step, each of the putative 
families was individually opened in a new tab and their 
members were inspected one after the other by examin- 
ing the data displayed in the candidate detail list and the 
linear visualization. Special attention was paid to the can- 
didate length and the length of the LTRs they contain 
as well as their feature composition. Candidates which 
obviously lacked features present in the majority of the 
members of their family were discarded by deleting them 
in LTRsift, placing them back into the group of unclassified 
candidates. This was particularly the case when a length 
aberration coincided with the loss of a common protein 
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domain. This situation was regarded as evidence for a 
deletion inside the candidate in question. After inspecting 
all groups in this way, we had a closer look at the pre- 
viously unclassified candidates by opening the respective 
tab and manually joined another 134 of them into four 
additional groups, again looking at their details. In addi- 
tion to cluster compatibility, this joining was based again 
on element and LTR length as well as specific features such 
as the presence of ORFs and their lengths. For this man- 
ual approach, new empty families were created using the 
LTRsift GUI and the respective member candidates were 
moved into them using the drag-and-drop functionality. 

To evaluate which existing families our de novo iden- 
tified families correspond to, we used the reference 
sequence matching function in LTRsift to compare all can- 
didate sequences with the reference sequence set of LTR 
retrotransposons used in our previous work [5], contain- 
ing 61 sequences. After matching the reference sequences 
to the candidates from within LTRsift using BLASTN 
(E-value threshold: 0.01), the results are displayed as an 
extra track in the candidate visualization. That is, in the 
delivered image the matched candidate region is cov- 
ered by a separate feature, while the label of the feature 
describes the matched region on the reference. We exam- 
ined our putative families for consistency according to the 
match features and assigned names which incorporate the 
recovered reference family (for example, the group corre- 
sponding to the existing mdg3 family was called clmeLl 
(mdg3)'). 

With little effort, we were able to recover 28 of the 61 
known families as denned by the reference sequence set 
(Table 1). In one case (the family named 472), we obtained 
two automatically derived putative families for one refer- 
ence family due to the LTR sequences of the candidates 
ending up in different clusters, breaking their compati- 
bility. The roo family proved to be a difficult family to 
recover due to the presence of only short protein domain 
matches within their members. However, joining them 
on the basis of the LTR sequences alone allowed us to 
obtain a putative family of 94 candidates, though not all 
of them appeared to be full-length. Another group diffi- 
cult to separate was a group of elements belonging to the 
Stalker I Stalker2l Stalker3T I Stalker4 families. Members of 
all of these were clustered into one large family, possibly 
due to the high similarity of their coding sequence. 

Seven additional putative families were found which 
could not be uniquely matched to any reference fam- 
ily. The majority of these consist only of three to five 
candidates with widely varying element and LTR lengths. 
These candidates are linked only by one protein domain 
or LTR sequence match and show short ORFs only. In 
some of them, spurious short matches to the Dm88 and 
GATE families were found in non-coding areas of these 
candidates. 
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Table 1 Results for the Drosophila melanogaster use case 



Assigned family 


Reference family 


Number of candidates 


dmelJ 


mdg3 


12 


dmel_3 


opus 


18 


dmel_4 


copia 


24 


dmel_5 


springer 


6 


dmel_6 


Burdock 


16 


dmel_7 


diver 


8 


dmel_8 


HMS-Beagle 


10 


dmel_9 


Tirant 


19 


dmelJO 


Tabor 


4 


dmel_11 


Quasimodo 


14 


dmel_12 


Transpac 


9 


dmel_14 


flea 


16 


dmel_17 


invader2 


8 


dmel_17_2 


invader3 


7 


dmel_18 


Max-element 


4 


dmel_22 


3S18 


6 


dmel_24 


McClintock 


4 


dmelJ.6 


Stalker 


25 


dmel_32 


17.6 


18 


dmel_33 


412 


17 


dmel_34 


412 


9 


dmel_36 


Idefix 


5 


dmel_39 


rover 


5 


dmel_46 


micropia 


3 


newfam_0 


blood 


25 


newfam_29 


HMS-Beagle2 


4 


newfam_40 


297 


20 


manual 


gypsy4 


7 


manual 


mdgl 


17 


manual 


roo 


94 



This table lists the putative families as assigned during our semi-automatic 
evaluation run on the Drosophila melanogaster genome (left column). The center 
column shows the name of the known family represented by that putative 
family, obtained from matching of the candidate sequence against a reference 
sequence set. The rightmost column lists the number of candidates in the 
respective family. The dmelJ.6 group (matched to various Stalker sequences) 
was not counted as recovered due to the multitude of non-unique matches to 
multiple references. Families with the newfam prefix were obtained by 
re-running the classification algorithm on subsets of the unclassified candidate 
set. Finally, families marked as manual were derived non-automatically. 

We have included the annotation files generated by 
LTRsift for this analysis as Additional file 2. 

Results of the M. domestical analysis 

By contrast, the goal of this analysis was to confirm that 
the LTRsift software can handle input data of a scale likely 
to be produced in de novo LTR retrotransposon prediction 
efforts in large genomes, for example, those of mammals. 
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As an example dataset, we applied LTRharvest and LTRdi- 
gest to the genome of M. domestica, the gray short-tailed 
opossum [48]. It is estimated that 10% of its approxi- 
mately 3.5 gigabase genome is comprised of endogenous 
retroviruses (ERV) [49,50]. This has been detected in a 
homology-based approach searching for protein-coding 
sequences known from the Repbase database [51], but not 
incorporating the LTRs as structural features for de novo 
detection. 

The M. domestica sequence (Broad Institute assem- 
bly version MonDom5) [GenBank:CM000368, GenBank: 
CM000369, GenBank:CM000370, GenBank:CM000371, 
GenBank:CM000372, GenBank:CM000373, GenBank: 
CM000374, GenBank:CM000375, GenBank:CM000376] 
was downloaded from the Ensembl website [52]. 

The LTRharvest and LTRdigest runs and the subsequent 
LTRsift analysis were performed on the same Linux stan- 
dard desktop system equipped with the same hardware as 
described above. We used a slightly different set of param- 
eters than in the Drosophila case because some peculiari- 
ties in the fly genome (such as unusual PBS-tRNA binding 
offsets) are not known for the opossum genome. In addi- 
tion, we used some extra protein domain pHMMs suitable 
for mammals (see Additional file 1, Section 3). 

The LTRharvest prediction with default parameters 
(see Additional file 1, Section 3) delivered 58,684 candi- 
dates. We then utilized the command-line filtering tool to 
remove all candidates which did not contain any protein 
domain hits, reducing the number of candidates to 13,944. 

This set of candidates was then loaded into an LTRsift 
project and preprocessed. The matches used for clus- 
tering were computed with the same settings as in the 
Drosophila case (LAST option -k 10, matches need to 
span 80% of the shorter sequence and 30% of the longer 
sequence). Full-length member deviation thresholds were 
identical as well. 

Matching and joining of the 13,944 candidates took 
about 30 minutes. As a result of the initial classi- 
fication on the basis of the parameters above, 171 
putative families were created containing 2,015 candi- 
dates altogether. There were initially 11,929 candidates 
that were unclassified, and 76 of the 171 initial puta- 
tive families contained only two members and were 
again discarded using the respective LTRsift function- 
ality. This left 95 remaining families comprising 1,863 
candidates. 

A look at the 95 remaining putative families reveals 
that the distribution of candidates across the families is 
skewed: 937 of the 1,863 candidates are in two puta- 
tive families, one with 722 members and one with 215 
members. The candidates in the smaller one are linked 
via their LTR and RT clusters only. The linear visual- 
ization in LTRsift shows that the location of the RT 
matches widely varies across the family members. Only 



very short ORFs are present. This may suggest that this 
family of 215 members is composed of non-autonomous 
candidates with many mutations. By contrast, the major- 
ity of the members of the second, larger putative family 
of 722 members contain a full set of protein domain 
hits - that is, protease, RT, RH and IN domains. In 
most candidates, hits to a protein domain associated with 
Gag were found as well. The candidates are consistently 
linked on the basis of these protein domain clusters, as 
well as their LTR sequence clusters. ORFs are predom- 
inantly longer (up to thousands of bases). We matched 
these candidates to known M. domestica ERV sequences 
downloaded from Repbase using the LTRsift reference 
matching functionality, resulting in partial and full-length 
matches to the ERV2JAD, ERV37MD and ERV11JAD 
reference sequences. The other, smaller families, contain- 
ing between 3 and 74 members, in many cases showed 
full sets of protein domains without being covered by 
a reference match, suggesting that there may be poten- 
tial for previously undiscovered or unclassified elements. 
Some of these protein domains also contain other rel- 
evant protein domain hits, such as other Gag domains 
(Gag_p30, Gag_MA and Gag_p24) or a potential Env/coat 
polyprotein (TLV_coat). 

To assess the possible number of yet undetected full- 
length candidates, we prepared a filter rule selecting only 
those candidates with a full set of protein domains. That 
is, the protease, RT, IN and RH functions must all be 
represented with at least one pHMM hit associated with 
that function. We used this rule to select matching can- 
didates from the whole candidate set using LTRsift. This 
delivered a new group containing 1,009 candidates pass- 
ing the filter. Afterwards we used the reference match 
coverage filter with threshold 80% to weed out those can- 
didates among the 1,009 that were already matched to 
a known reference sequence. As a result, we only found 
159 candidates with a full set of protein domain hits 
which have a match to a reference sequence over at least 
80% of their internal sequence or more, leaving 850 still 
unmatched and interesting for further analysis. This illus- 
trates that LTRsift allows a user to conveniently prepare 
interesting subgroups of candidates on the basis of their 
features. 

Discussion 

The advantages of having a specific graphical applica- 
tion for classification, postprocessing and curation of LTR 
retrotransposon candidates are obvious when compared 
with a purely manual approach. No data conversion is nec- 
essary when using de novo candidate detection tools that 
are able to output the increasingly common GFF3 format 
to represent their results. LTRsift is intended to be used 
with the LTRharvest and LTRdigest software that satisfy 
this requirement. 
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Once input data are imported into a project file, the 
project and all associated data are loaded quickly, even 
when the number of candidates is in the tens of thousands, 
as may occur when analyzing large mammalian genomes. 
The window configuration of the user interface, includ- 
ing all open tabs, is saved with the project, allowing the 
user to continue previously interrupted work in the same 
environment. LTRsift displays candidates and their family 
memberships in an intuitive way, using common GUI con- 
cepts like drag-and-drop to support manipulation. A user 
familiar with web-based genome browsers can intuitively 
understand the visual candidate representation. Features 
are shown in color-coded tracks and are labeled with addi- 
tional information, such as domain names, match targets, 
or reading frame and orientation. The view is extensively 
configurable using a style file, making it possible to adjust 
both colors and other layout options such as font sizes, 
as well as enable or disable the display of specific feature 
types. 

Filtering rules are powerful while easy to write with 
basic programming skills. They support extensive access 
to the candidate annotation and can be combined to form 
more complex conditions, not only allowing the user to 
discard candidates but also to add them to new subgroups 
that do or do not satisfy the condition described by the fil- 
tering rules. Another advantage is that filtering rules are 
self-contained in one text file per rule. This makes it possi- 
ble to distribute user-defined filtering rules in the research 
community. 

The matching, clustering and classification components 
used in LTRsift have been designed to be modular, making 
it possible to quickly add support for new matching tools, 
clustering strategies or classification algorithms. A unified 
representation of the candidates as graphs allows the same 
for other tasks working on annotations and sequences, 
creating new annotations in the process. 

The currently implemented ORF detection and refer- 
ence annotation components are good examples for such 
tasks, employing GenomeTools functionality and third- 
party tools like BLAST to transparently extend the anno- 
tation within the graphical interface. 

We are not aware of any other tool for this exact task 
of supporting postprocessing and curation of de novo LTR 
retrotransposon annotations in a fashion similar to LTR- 
sift There is a graphical tool, VisualRepbase [53], available 
to display occurrences of TEs, for example, taken from 
the Repbase Update database [51] in a genomic context 
together with annotations, for example, downloaded from 
the NCBI databases. However, VisualRepbase does not 
support the formation of families inside the database, nei- 
ther does it display the internal structure of the displayed 
elements in terms of features. By contrast, LTRsift does 
not take the genomic neighborhood of the candidates into 
account. 



The classification approach as described in [5] works 
well in the Drosophila genome. Nevertheless, a large vari- 
ety of clustering and family assignment strategies exists, 
and it would be a natural assumption that other, more 
sophisticated approaches may work better on candidates 
from other genomes. The modular architecture of LTR- 
sift allows the user to incorporate alternative classification 
strategies in the future, creating a comprehensive and flex- 
ible solution for the integration of tools among the diverse 
landscape of classification methods. 

A basic requirement of our classification approach is the 
presence of annotated internal feature sequences, whose 
similarities are used to separate candidates into putative 
families. Many of the candidates satisfy this requirement. 
Unfortunately, in the M. domestica data set no features 
were detected in the majority of the candidates. Con- 
sequently, these candidates can be either false positives 
or non-autonomous copies. While non-autonomous ele- 
ments are indisputably important in general, prior anal- 
yses have shown that such de novo predicted candidates 
are often unreliable and may well be false positives, as we 
demonstrated in an earlier use case [5]. However, there are 
also non-autonomous elements which still retain internal 
features, which can readily be processed using LTRsift and 
included in a reference sequence set. This set could then 
be used as a starting point for homology-based detection 
of more truncated copies in the genome. 

World Wide Web-based solutions are becoming 
increasingly popular for interactive and sometimes dis- 
tributed analysis of structured data sets due to their 
platform independence on the user side - only a web 
browser is needed to access the data from any loca- 
tion with a network connection. We did not follow a 
web-based approach for the design and development of 
LTRsift, The main reason is that the size of the underlying 
genome sequences may well become too large to be con- 
veniently uploaded to a web server when analyzing large, 
for example, mammalian, genomes. By contrast, the size 
of the annotations is of moderate size (approximately 42 
MB for the full unfiltered M. domestica candidate set). 
The sequence is required to perform sequence-based 
analyses like reference matching, ORF finding or simply 
to display short motif sequences like PBS and PPT in 
the candidate details. Hence uploading the annotation 
alone would not suffice to display every interesting bit of 
information on the candidates. Instead, we chose to build 
LTRsift on an open source GUI platform intended to run 
on freely available desktop operating systems. Sharing 
project files, for example, via a shared network drive, 
allows a distributed annotation. 

While parameterization of the filtering rules is cur- 
rently only possible by editing the rule files directly, a 
useful improvement would be the definition of param- 
eter sets, which can then directly be set in the filter 



Steinbiss etal. Mobile DN A 2012, 3:18 
http://www.mobilednajournal.eom/content/3/1/18 



Page 12 of 13 



selection window. This would also allow the use of multi- 
ple instances of the same filter with different parameters 
without having to copy the file. Another desirable feature 
would be support for multiple levels of group member- 
ship per candidate. For example, a candidate may appear 
in multiple groups, which in this context describe the 
candidate on multiple levels: it may appear once in its spe- 
cific family, and once in a more general 'copia-like' group. 
This could be implemented by assigning tags to candi- 
dates and allowing queries on the tags to group candidates 
together. While functionality to export sequences already 
exists, it could be very useful to be able to start exter- 
nal tools on given candidate sets (for example, selected 
candidates or all members of a putative family). A typical 
example for this kind of use would be a multiple sequence 
alignment tool such as ClustalX [54] or others. Finally, on 
some occasions, exporting the linear candidate visualiza- 
tion as an image file would be desirable. Such functional- 
ity could easily be implemented as the AnnotationSketch 
engine supports output in a variety of vector and bitmap 
formats [45]. 

Conclusions 

We have developed LTRsift, a software tool for visualiza- 
tion and postprocessing of de novo predicted LTR retro- 
transposon annotations. It literally allows the user to sift' 
through a possibly large quantity of results from a pre- 
diction and annotation software like LTRharvest and/or 
LTRdigest, which it was designed to work with. How- 
ever, it relies on standard data formats and can also work 
on results from other tools, given that the input data 
are appropriately formatted. LTR retrotransposons can be 
assigned both automatically and manually to groups con- 
sidered putative families, which can then serve as a basis 
for comprehensive sequence library preparation. To the 
best of our knowledge, LTRsift is the first software for this 
specific task, implementing not only classification but also 
flexible, customizable filtering in a graphical environment. 
Relying on a common GUI toolkit from the open source 
world, the user interface is familiar to everyday computer 
users. LTRsift is efficient enough to allow work with large 
datasets consisting of up to tens of thousands of candi- 
dates on standard desktop hardware, making it likely to 
be used by life scientists preferring a visual, exploratory 
hands-on approach to dealing with result data. 

Availability and requirements 

• Project name: LTRsift 

• Project home page: 

http://www.zbh.uni-hamburg.de/LTRsift 

• Operating system(s): UNIX-like systems, for 
example, Linux or Mac OS X 

• Programming language: C (GUI software), Lua 
(filtering rules) 



• Other requirements: GenomeTools version 1.4.2 
and higher (not required by the static version), GTK+ 
2.2.4 and higher 

• License: GPL2 

Additional files 



Additional file 1 : Technical information. This PDF file contains 
additional information on how to write filtering rules for the LTRsift 
software. Besides a description of the rule file structure, it also contains a 
documentation of the functions to access the representation of the 
candidate annotation and a documentation of the command line tools. 
Finally, it lists parameterization details for the example runs. 

Additional file 2: Example annotation for D. melanogaster. This 
gzipped tar archive contains the annotation GFF3 file created as a result of 
our evaluation runs for the D. melanogaster genome, as well as the 
corresponding sequence in FASTA format. 
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